Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid double MultiIndex factorization in groupby index result #17644

Merged
merged 4 commits into from
Dec 30, 2024

Conversation

mroeschke
Copy link
Contributor

Description

From an offline discussion, it was discovered that a groupby with multiple keys experienced a slowdown somewhere between 24.02 and 24.12. An identified hot spot was the creation of the resulting MultiIndex where each resulting level undergoes factorization upon construction (which changed somewhere in between these releases to fix consistency bugs)

While the eager factorization is a new performance penalty, I identified that this factorization was being done twice unnecessarily when performing an agg. This PR ensures we avoid creating this MultiIndex until necessary and cache this MultiIndex result to avoid double factorization in the future

# PR
In [1]: import cudf
   ...: 
   ...: df_train = cudf.datasets.randomdata(nrows=50_000_000, dtypes={"label": int, "weekday": int, "cat_2": int, "brand": int})

In [2]:     target = "label"
   ...:     col = ['weekday', 'cat_2', 'brand']

In [3]: %%time
   ...: df_train[col + [target]].groupby(col).agg(['mean', 'count'])
   ...: 
   ...: 
# PR
CPU times: user 366 ms, sys: 109 ms, total: 474 ms
Wall time: 482 ms

#branch 25.02
CPU times: user 547 ms, sys: 112 ms, total: 659 ms
Wall time: 658 ms

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 20, 2024
@mroeschke mroeschke self-assigned this Dec 20, 2024
@mroeschke mroeschke requested a review from a team as a code owner December 20, 2024 19:19
@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 88d9251 into rapidsai:branch-25.02 Dec 30, 2024
106 checks passed
@mroeschke mroeschke deleted the perf/groupby/mi branch December 30, 2024 22:48
rapids-bot bot pushed a commit that referenced this pull request Jan 4, 2025
Noticed while working on #17644 that `diff` and `fillna` were make some unnecessary shallow copies of the `grouping.value` object. Also noticed that `_cov_or_corr` just pulled the column names out of `grouping.value` object, so made a separate API, `values_column_names` to just create the column names without pulling out the actual columns.

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #17646
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants